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ABSTRACT. A novel adaptive Markov chain Monte Carlo algorithm is presen¬ 
ted. The algorithm utilizes sparisity in the partial correlation structure of a 
density to efficiently estimate the covariance matrix through the Cholesky factor 
of the precision matrix. The algorithm also utilizes the sparsity to sample effi¬ 
ciently from both MALA and Metropolis Hasting random walk proposals. Fur¬ 
ther, an algorithm that estimates the partial correlation structure of a density is 
proposed. Combining this with the Cholesky factor estimation algorithm results 
in an efficient black-box AMCMC method that can be used for general densit¬ 
ies with unknown dependency structure. The method is compared with regular 
empirical covariance adaption for two examples. In both examples, the proposed 
method’s covariance estimates converge faster to the true covariance matrix and 
the computational cost for each iteration is lower. 

Key words: MHRW, MALA, AMCMC, Online estimation, Cholesky estimation. Partial 
correlation 


1 Introduction 


Markov chain Monte Carlo (MCMC) algorithms are widely used for sampling from complicated 
distributions. As the dimension of the distribution gets larger, tuning the parameters of the 
algorithms becomes both more important and more difficult. Adaptive MCMC (AMCMC) 
methods address this issue by tuning the MCMC algorithm while it is running. 

One of the most popular MCMC algorithms is the Metropolis Hasting random walk (MHRW) 
algorithm ( [Tierne ^ [ToMl ). In the MHRW, new samples are proposed using the proposal ker¬ 
nel Q(x,.) = N(0,X1), where Xl is a scaling matrix and N denotes the normal distribution. 
The optimal scaling matrix XI is a rescaled version of the covariance matrix of tt. One of the 
main reasons for the popularity of the algorithm is that it is easy to implement; however, even 
the optimal scaling of the covariance matrix decreases as 0{n~^) where n is the dimension 
of X (Roberts and Rosenthal, 2001). This means that optimal proposals are small for large 


dimensions, which causes poor mixing of the MCMC chain. 

An alternative to the MHRW algorithm is the Metropolis adjusted Langevian algorithm 


(MALA) (Grenander and Miller, 1994). The MALA is generally more difficult to implement 
compared with the MHRW, as it proposal kernel is Q(x,.) = N(yV log(7r(x)), XI). The advant¬ 
age is that the optimal scaling matrix decreases with 0(n“^/^), which is a big improvement 
compared with the MHRW scaling. 

Using the covariance matrix, of tt, for the proposal has a huge effect on the convergence of 
the algorithms in practice. Unfortunately, the matrix is almost never known in advance. In 
the AMCMC framework one solves this by replacing the covariance matrix with the empirical 


covariance matrix (ECM) from previous samples in the MCMC algorithm (Haario et al., 2001). 
If the dimension of tt is large, using the ECM is problematic: It converges very slowly towards 


the true covariance matrix (Roberts and Rosenthal, 2009) and the evaluation of the proposal 


density requires computing the inverse of the ECM. In large dimensionon, the inverse will often 
be the computational bottle neck of MH algorithm. 

The computation of the inverse is so computationally expensive that it often dominates the 
computational cost for each iteration of the MCMC algorithm. 
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In this paper, we propose an AMCMC algorithm that reduces these two issues for a large 
class of high dimensional densities. The general idea is to estimate the Cholesky factor of the 
precision (inverse covariance) matrix instead of using the ECM directly. The precision matrix 
and its Cholesky factor will be sparse if the target density has a sparse partial correlation 
structure (Lauritzen, 1996; Rue and Held, 2005). If the Cholesky factor is sparse there are 


fewer elements to estimate than in the covariance matrix, resulting in faster convergence. Also, 
the evaluation of the proposal density becomes significantly faster by using the sparse Cholesky 
factor. 

The algorithm is especially well suited for Bayesian hierarchical models since they are 
often constructed by a directed acycle graph (DAG), which defines a conditional dependency 
structure of the model. 

A potential issue with the method is that it cannot be used if the partial correlation 
structure is unknown. To remedy this, we propose a second algorithm which estimates the 
conditional density structure, and uses this to define the sparsity of the Cholesky factor. 
Although partial correlation does not imply conditional independence and vice versa, it is 
reasonable to use this as a sparsity pattern for the proposal distribution. Combining this 
algorithm with the Cholesky estimation algorithm results in an efficient black-box AMCMC 
method that can be used on general densities with unknown dependency structure. 

The article is composed as follows. In Section 2, regular covariance-based AMCMC for 
MHRW and MALA is presented and the new adaption method is introduced. In Section 3, 
the algorithm that updates the Cholesky factor in an MCMC iteration is presented. The 
method is applied to two problems in Section 4. In both problems, the densities are derived 
from discretizations of infinite dimensional models, and has a lot of conditional dependency 
structure. Finally, Section 5 contains a brief discussion of future work. 

The code for the method and the examples are available at (Wallin and Bolin, 2015). 


2 Adapation of Metropolis Hastings algorithms 


The c l assica l adaptive MHRW for an n-dimensional target distribution tt, introduced by[H 


aario 


et al. 


(2001), uses at iteration i the proposal Qi(x,.) = N ^x, Here 






r. 


( 1 ) 


where x^^) = \ previous i samples. Typically the ECM is 

adjusted with le to ensure that the matrix is positive definite. Similar adaptation can be used 
for the MALA and the proposal distribution is in that case 

Qi(x, .) = N (x+y5:Wviog(7r(x)),c7^5:W) , 


where cr is a suitable scaling factor, that often also needs to be adapted. We will refer to 
the method above as covariance adaption. Algorithm display an iteration of MALA with 
covariance adaption. 

A simple way to reduce the computation time of the algorithm is to update the Cholesky 
factor of the ECM directly instead of the covariance matrix at line 15 in Algorithm 

The adaption method we propose uses the same proposals, but instead of estimating the 
covariance matrix it estimates the Cholesky factor of the precision matrix, denoted L. We 
denote this adaption method precision adaption. The proposal for the precision adaptation 
is easily reconstructed from the equations above by replacing with (L^^^(L^^^)^)“^. Al¬ 
gorithm displays an iteration of MALA with precision adaptation. Estimating L is not as 
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Algorithm 1 MALA with covariance-based adaptation 


1 : 

2 : 

3 : 

4 : 

5 : 

6 : 

7 : 

8 : 

9 : 

10 : 


11 

12 

13 

14 

15 

16 
17 


procedure MALA(x, S, x, a, tt, i) 

L ^ Choli^E) 
z ^ N(0,I) 
g ^ Vlog(7r(x)) 

Si ^ L'^g 

X* ■(— + (jLz 

g* ^ Vlog(7r(x*)) 
gf ^ L^g* 

U ^U{0,1) 

a ^ log(7r(x’^)) - log(7r(x)) - y (gn"^g? 

+ - x*)^(g* + g) + y (gz)^g/ 

if log{U) < a then 

X ^ X* 

end if 

^ + ilr (^ - ^) (x - 

return {x, x, Xl} 
end procedure 


straightforward as estimating the covariance matrix with the ECM, and the crucial L-Update 
method at line 14 in the algorithm will be described in the next section. 

When introducing adaptation in the algorithm it is important to ensure that tt remains 
the stationary distribution of the algorithm. In Appendix we adress this by connecting 
convergence of precision adaption to covariance adaption. 


3 Online estimation of the Cholesky factor 


In this section we present a method for online estimation of the Cholesky factor of a density’s 
precision matrix. The method will be formulated so that it can take advantage of sparsity 
properties of the Cholesky factor, which determines the partial correlation structure. 

The section is structured as follows. In Section |3.1[ we derive an algorithm producing an 
online least squares estimate, of the Cholesky factor from i samples from tt, assuming 


that the partial correlation structure of tt is known. After this, in Section |3.2| an algorithm 
for estimating the conditional dependency structure of tt is derived, which can be used if the 
partial correlation structure is not known. 


3.1 Least squares estimation 

Assume that X is a random vector with distribution tt and finite positive definite covariance 
matrix XI. Also assume, without loss of generality, that E(X) = 0. We will in this section 
formulate an estimation method for the Cholesky factor, L, of the precision matrix Q = Xl“^. 
This will be done by first finding a representation of L as a series of regression problems, 
which is a well-known method to estimate covariance matrices for normal distributions (see 
for example Pourahmadi, 2011). The theorem below shows that regressing Xj on Xj+i: 7 v for 
j = 1,...,A can be used to construct L. This result is well-known when X is Normally 
distributed; however, we have not been able to find a proof without the normal assumption so 
we provide one here. 
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Algorithm 2 MALA with precision-based adaptation 


1 : 

2 : 

3 : 

4 : 

5 : 

6 : 

7 : 

8 : 

9 : 


10 

11 

12 

13 

14 

15 

16 


procedure MALA(x, L, x, a, tt, i,...) 
z ^ N(0,I) 
g ^ Vlog(7r(x)) 

El ^ 

X* ^ + (7z^ 

g* ^ Vlog(7r(x*)) 
g; ^ L-ig* 

u ^U{0,1) 

a ^ log(7r(x*)) - log(7r(x)) - y (gn"^g? 

+ - x’^)^(g* + g) + y (g/)^g/ 

if log({7) < a then 

X ^ X* 

end if 

L ^ L-update(x - X,...) 
return {x, x, L,...} 
end procedure 


> see Algorithm for details 


Theorem 3.1. Let X be an N — dimensional random variable with zero mean and positive 
definite eovarianee matrix X. Define T as the upper triangular matrix with unit diagonal 
and j+i-w^Jd+i:^ J = Further, define D as the diagonal 

matrix with Djj = Xjj — Then the Cholesky faetor ofis 

L = and is the solution to 


arg mint E 




( 2 ) 


Proof. Let e = TX and D = C[e]. By construction, the variance of ej is Djj 
^yj+i:W^7+i:Mj+i:A^^-^’d+i:A^* Si^^cc D = C[e] = C[TX] = T^ET, the proof is completed by 
can showing that D diagonal matrix. Assume that D is not a diagonal matrix and that 
j < i is non zero. Thus 

or equivalently, by the definition of e. 


V [x, - (Xi - tt^i^^x,+i.jv) 1 < v[x, - 


However j+i-a^^Lj+I:^ which is the best linear predictor (see 

section 1.2), and thus minimize (§ which is a contradiction. 


Stein 


1999 

— n 


In the case when E is unknown and one has i samples of X, one can get the least square 
(LS) estimate of the Cholesky factor, L, by using the LS solution of (§ for each j. This is 
equivalent to replacing E in the theorem above with an empiri cal estimate E. Using the Delta 


van der Vaart 


2000 


Theorem 


method, one can show that L converges to L as i ^ oc (e.g. 

20 . 8 ). 

In general, L is the same matrix as the Cholesky factor of the inverse of the empirical 
covariance matrix. And in fact, computing L by first computing E and then the cholesky 
factor require less computational effort than the regression method presented above. However, 
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Algorithm 3 Cholesky factor updating 

1: procedure L-update(X, i) 

2: D ^ OtvxAT- 

3: T ^ diag{lixN) 

4 : for j = 1,..., A do 

■y^rsp y- 

5: ^iAjUj),j < r^(AjUj),j H i - 

6: '^Aj,Aj ^ ^^Aj,Aj 1 ~[^) ^ Updating the matrix using Sherman-Morrsion 

7: ^ '^Aj,Aj'^Aj,j 

8 : ^j,j ^ '^3,3 ~ '^Aj,i'^3,Aj 

9 : end for 

10 : L ^ T'^D-^/2 

11: return {L, ,, S(A,ui),i> }iti 

12: end procedure 


if TT has a sparse partial correlation structure, this can be utilized in the regression method 
to more efficiently construct L. Suppose, for example, that Xj and Xj+2:n sire uncorrelated 
given Xj+i. Then it follows that Tjj+2:n = 0, and thus to compute only the 1x1 matrix 
Xj+iy+i needs to be inverted, instead of the N — j-\-lxN — j-\-l matrix Xlj+i:7v,j+i:W- 

Using these ideas we construct Algorithmic that online updates to given a new 

observation X. The list Aj used in the algorithm is a set of indices such that the variable 
Xj is assumed to be partially uncorrelated wit h ^{j ^i:N}\Ai given X^^.. In line 6 of the 
algorithm, Sherman-Morisson formula (Bartlett, 1951) is used to update the lower triangular 
part of the inverse of This reduces the complexity of each iteration in the for-loop 

from 0{\Aj\^) to (9(|Ajp). Thus, the algorithm has a computational complexity bounded by 
0{N uiAXj |Ajp). This complexity should be compared to updating X (or its Cholesky factor) 
which has a complexity of 0{N‘^). 

Note that the algorithm is simple to parallelize since the iterations in the loop can be 
executed independently of each other. 


3.2 Finding the conditional sparsity structure 

So far it has been assumed that we know the partial correlation structure of the target distri¬ 
bution, which is needed to construct the set {Ai}fLi in Algorithmic However, it is often not 
practically feasible to derive the partial correlation structure, which is problematic if we want 
to use Algorithmic Because of this, we will in this section present an algorithm that estimates 
the set {Ai}^^. 

Instead of estimating the partial correlation structure directly, we estimate the conditional 
dependence structure and then use this to construct a sparity pattern {Ai}fLi. Although 
conditional independence does not imply that the variables are partially uncorrelated (see 

for a counter example) its seems reasonable to use the conditional 
other knowledge is available. Here it is also important to stress that 
the adaptation algorithm is valid even without using the correct parital correlation structure. 
In fact, one can use any sets {Ai}fLi in the algorithm and it for instance estimates, in a 
somewhat complicated way, a diagonal covariance matrix if Ai = {0}. Of course, in order to 
get good mixing one should use a dependency structure that contains the elements that are 
strongly correlated in the target density. 

Below, we show how to estimate the dependence structure and then how to create {Ai}fLi 
given the structure. However, we first need to define more precisely what we are estim- 


Wermuth and Cox, 1998 


dependece structure if no 
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ating. To that end, we define variables and Xj to be conditionally independent if 
7r(X^|X_p j}, Xj) = 7r(X^|X_pj}). The conditional dependency structure of tt can be rep¬ 
resented using an undirected graph Q — ({1,..., where {i, j} is in the set of edges, 

if and only if X^ and Xj are conditionally dependent, see [Rue and Held] ( |2005[ ) for more 
details. Thus the goal first goal is to find E. 

Let ei represent the canonical basis vector, that is Cij = 1 for j = i and Cij = 0 otherwise. 
The main idea is that one can estimate E by 




^log7r(x) 

d:x.n 




(3) 


The following proposition proves that E C E. That is, the estimated conditional depend¬ 
ency structure always has at most as many edges as the true dependency structure, but is 
often identical to E. 

Proposition 3.2. Let X have density tt and define E through (§. If {i^j} G E then Xj is 
not eonditionally independent of^i given X_pjp 

Proof. Assume that {i, j} G EnE and that Xj and X^ are conditionally independent. By the 
conditional independence 


log7r(x) = log7r(xj|x_j) + log7r(x_j) 

= log7r(xj|x_{j^j}) + log7r(x_j), 


and thus 


d log 7r(x) d log 7r(x -h 


5Xo 


5Xo 


which contradicts j ^ E. 


□ 


To find {Ai}fL-^ given E^ note that E is equivalent to a zero pattern of a symmetric matrix 
Q. From the non-zero pattern of Q one can construct a non-zero pattern of its Cholesky factor. 
The sparsity pattern of the Cholesky factor is not invariant to the ordering of the vector X. 
For a sparse symmetric matrix one can in general reduce the number of non-zero elements in 
its Cholesky factor by using a clever ordering, and because of this we add a reodering step to 
the algorithm. There are several possible reordering methods one could use in this step, and 
we use an Approximate Minimum Degree (AMD) ( [Davis , 2006| method. Figure [^displays the 
effect of using an AMD reordering for an example presented in the next section. The reodering 
reduced the number of symbolic non-zero elements in L from 9689 to 1380. 

Putting all steps together results in Algorithm It should be noted that the algorithm 
can be modified so that {Aj}^^^ is updated online within an MCMC iteration. Also note that 
numerical round-of errors can cause the statement of Propositionto fail when implementing 
the algorithm numerically, however we have not experienced any problems with this on the 
applications we have tested the method on so far. 


4 Examples 

In this section, we compare the performance of the precision-based adaption method to the 
covariance adaptation method using two examples. 

The purpose of the first example is two-folded: First, to show that the precision adaption 
converges to the correct scaling matrix and to show the rate of convergence. We use a Gaussian 
density because this gives an explicit value of (mixing) performance of a scaling matrix com¬ 
pared to the optimal mixing. The second purpose is to study the behavior of adaption under 
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Algorithm 4 Conditional dependence estimation 

1 

procedure A-update(X, V log tt) 


2 



3 

for i = 1,..., N do 


4 

dX^ Vlog7r(X) 


5 

Vlog7r(X + eO 

> e is a canonical basis vector 

6 

E^EU {{ij} : dX| - dX^- ^ 0} 


7 

end for 


8 

Q ^ symbolicMatrix(({l,..., X}, E) 

> the matrix sparsity pattern 

9 

r^AMD{Q) 

[> r is a reordering 

10 

C ^ Cholesky(Qrr) 

[> the resulting sparsity pattern 

11 

for i = 1,..., X do 


12 

^ {j • A,- 7^ 0} \ {i} 


13 

end for 


14 

return 


15 

end procedure 





Figure 1: Conditional dependency structure for the spline model used in the example in Section 


4.2 The sparsity pattern Q (left), the sparsity pattern C (middle), and the sparsity pattern C 


after reordering the nodes using an AMD reodering (right). The number of non-zero elements 
in the matrices are 1860 (left), 9689 (middle), and 1380 (right) and the size of the matrices 
are 200 x 200. 


an increasing dimension of tt. For this example we display only the results for the MALA since 
the results for the MHRW tells the same story. 

For the second example, we have chosen a density with very sparse conditional dependence, 
but with strong correlation between the parameters. Here we compare adaption for both 
MHRW and MALA, and we also estimate the sparsity using the the conditional dependence 
since we do not know the partial correlation structure. 

In both examples we additionally adapt a scaling factor of the proposal, using the method 
proposed by Roberts and Rosenthal (2009), so that the acceptance rate is 0.234 for MHRW 
and 0.574 for MALA. 


4.1 An example from spatial statistics 

This first example is a latent Gaussian model taken from spatial statistics. Let X(s), s G 
[0,1] X [0,1], be a mean-zero Gaussian Matern fields represented as a solution to a stochastic 
partial differential equation (k? — A)^X(s) = W(s), where a is a shape parameter, k? is 
the parameter controlling the correlation range of X, W is Gaussian white noise, and A is 
the laplacian. The field is discretized using a finite element method (FEM), resulting in an 
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approximation 




Xi 


( 4 ) 


1=1 


where {ipi) are piecewise linear basis functions induced by a triangulation of the domain and 
X = (xi,... ^Xn) is a mean-zero Gaussian Markov random field with sparse precision matrix 
Q. See [Lindgren et aE (2011) for further details of the construction. 

The field is observed under Gaussian measurement noise at 100 locations si,..., Sioo chosen 
at random in the domain, resulting in observations yi — X{si) -h where Si are iid N(0,cr^) 
variables. The resulting posterior distribution for x given the observations is 


7r(x|y) oc exp (y - Ax)"^ (y - Ax) - ix"rQ ^xj, 


( 5 ) 


where the matrix A is a sparse observation matrix that links the observation locations to the 
random weights x through Aij — (pj{si). 

The number of basis functions, n, controls the accuracy of the FEM approximation, and by 


increasing it, we can study how well the adaptation method scales. [Cotter et al.| ( |2013[ ) studied 
MCMC methods for discretizations of continuous functions and pointed out that many MCMC 
methods scales poorly with an increasing number of basis function. It is therefore interesting 
to compare precision adaption with regular covariance adaptation for this example when we 
increase n. 

Since both the prior and the likelihood are Gaussian, the posterior density of x|y can be 
shown to be Gaussian with covariance matrix Xl = (Q -h ^ A^A) . Thus, this covariance 
matrix is the optimal scaling matrix for the MALA. As a measure of how good another proposal 
matrix Tip is, we use 

E-=1 A. 


b = n 


where {A^} are the eigenvalues of 


-1 


The optimal value of b is one and the larger value 


the worse proposal matrix, see Roberts and Rosenthal (2001) for details of this measure. 

Figure [^displays the convergence of b in AMCMC simulations of with n — 10^, 20^, 30^, 
and 40^ basis functions in the FEM discretization. Although 40^ is a rather small dimension for 
a Gaussian random field, it is quite large for a black box adaptation method. In the figure, one 
can clearly see that the precision adaptation converges, at least initially, much faster towards 
a reasonable proposal matrix compared with the covariance adaptation method. Further, 
from Figure one sees that both sampling the MALA and computing the Cholesky factor 
the precision adaption scales much better with size of the dimension compared to the regular 
covariance adaption. 

In conclusion, we see that the precision adaption outperforms the regular covariance adap¬ 
tion by a large margin, although neither method is great without a good initial guess of the 
covariance matrix. 


4.2 Adaptive spline smoothing 

In this example we compare the adaption methods on a Bayesian model with a sparse condi¬ 
tional dependency structure. The example is an adaptive smoothing spline model with varying 
measurement error, and stochastic differential equations are again used to define the model. 
Let X (t) be a twice differentiable second order random walk, defined as the solution to 

TxAX{t) = Wiit), 

where Wi is Gaussian white noise. As in the previous example, we measure X under Gaussian 
noise, resulting in observations Y{ti) = X{ti) + Si. The difference here is that we assume that 
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Figure 2: The factor b defined in equation which is a measure of how good the proposal 
is, for various discretization sizes. The optimal value of 6 is 1. The dashed line is the preci¬ 
sion adaptation and the solid line is regular covariance adaptation. Note that the number of 
iterations varies between the plots at the bottom panels and the top panels. 


the measurement noise 6: has a varying variance, V{£i) — where V{ti) is a second order 

random walk: 


TvAV{t)=W2{t). 


Here W 2 {t) again is Gaussian white noise. As in the previous example, the differential equations 
are discretized using a FEM approach, described in detail by Lindgren and Rue (2008). The 
resulting joint posterior distribution is 


log7r(x,v, cry, ally) oc - i (y - Aa;x)'^D(e (y - Aa;x) 

n 

E 'Tx TrA TrA 

Vi - yX Q^x - y V Q^V 

i=l 

Ti Ti 

+ 2 + 2 C^) 


a; and Ay are as in 


Here D (v) is a diagonal matrix with v on the diagonal. The matrices A 
the previous example observation matrices that link the observation locations to the random 
weights X and v for the FEM discretizations. The matrices and are sparse tridiagonal 
matrices, given in explicit form in (Lindgren and Rue, 2008). Einally, the last terms in 0 


come from assuming exponential priors on the precision parameters. 


We apply the model to a classical data set of motor-cycle crashes, analysed by Silverman 


(1985). The observations are accelerometer readings taken through time in simulated crashes 
used to test crash helmets. Eigure displays the data points and it is clear that the variance 
is not constant over time. 

We use 250 piecewise linear basis function for both x and v in the basis expansion 0. In 
Eigure|^ the solid line is the posterior mean of the spline function X{t) and the dashed lines 
show the posterior means of X{t) ± 
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Figure 3: The solid line shows average sampling time (in ms) of using covariance based scaling 
matrix as a function of the dimension of x. The dashed line shows the average sampling time 
using precision based scaling matrix. The solid circled line shows the average time of updating 
the Cholesky factor of the covariance matrix. The dashed circled line shows the average time 
of updating the Cholesky factor of the precision matrix. 


To compare the adaption for MHRW and MALA, we study the convergence of the samples 
for X 21 (which corresponds to X(6.8)) and r^. One expects the parameter to be more difficult 
to sample than X 21 since it is higher up in the hierarchical structure. 

In Figure]^ we display trace plots of ten million samples of log(T^) for all adaption meth¬ 
ods. It is apparent from the figure that the adapataions for MHRW did not converge before 
the algorithm was stopped, and it is not clear if the MALA method with covariance adapta¬ 
tion converged. Finally, MALA using precision adaptation converges to a steady state after 
approximately 2.5 million samples. The results for X 21 in Figure tells a similar story. 

Table displays the average cost of one iteration for each of the methods. Here it is 
worth pointing out that precision adapted MALA has a lower computational cost than regular 
MHRW with covariance adaption. 

The numbers should be put in relation with the number of basis functions used in the FEM 
approximations. The relative improvement of precision adaption compared with covariance 
adaption would increase if one were to use more basis functions, and it would decrease if fewer 
basis function were used. 

In conclusion, from the above result it is clear that one should use MALA with precision 
adaptation for this example. 


_MHRW MALA 

covariance adaption 0.67 1.32 

precision adaption 0.49 0.59 


Table 1: Average time per iteration in ms for the different sampling-adapation schemes for the 
second example. 


5 Discussion 

In this article we have derived an adaptive MCMC algorithm that online estimates the Cholesky 
factor of the precision matrix of a density using a least squares method. The method thus 
only assumes existence of the second moment of the target distribution and does not rely on 
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Figure 4: Data and results for example 2. The black dots are measurements, the solid line is 
the posterior mean of X(t), and the dashed lines are the posterior mean of X(t) ± 


any Gaussianity assumptions. We have shown that by taking advantage of the dependency 
structure of the target density the algorithm can generate an efficient sampling method. The 
updating of each row in the Cholesky factor is done independently and the method is thus easy 
to parallelize. Furthermore, the method could be extended so that one updates the Cholesky 
factor from several independent MCMC trajectories, similarly to the methods by |Solonen et al.| 

An interesting research direction is to try to formulate the regression problem for con¬ 
structing the Cholesky factor such that the formulation varies over the parameter space. For 
instance, in the first example we used a fixed parameter of the measurement error, a, which in 
most applications instead would be a random variable. Since the smaller a is relative to Q, the 
more the posterior distribution of the random field x is concentrated around the observations. 
Thus a proposal for x should vary scale with a. Which could be constructed by allowing the 
elements of D to vary with a. 
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A Convergence of precision adaption 

From the arguments by [Roberts and Rosenthal] ( |2009[ ) , we need to ensure two conditions to get 
theoretical justification of using a AMCMC algorithm. The two conditions are Diminsihing 
Adapation and Bounded convergence. 

The precision adaptation method satisfies Diminsihing Adapation since the elements up¬ 
dated in Algorithm 1^ only change with 0{j) at the iteration. Bounded convergence is a 
little more subtle to show and there exists various method to ensure the condition. In the 
following proposition we link precision adaption with covariance adaption. 

Proposition A.l. LetS be the set of positive definite matrices Then the L-UPDATE 

function in Algorithm\^ is a continuous function from S to the set of invertible lower triangular 
matrices. 
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(b) 



X 10® 


(d) 



X 10® 


Figure 5: The panels show trace plots of log(r^). The samples are taken from runs of ten 
million MCMC iteration where every fiftieth sample is stored. The methods shown are MHRW 
with covariance adaptation (Panel a), MHRW with precision adaption (Panel b), MALA with 
covariance adaption (Panel c), and MALA with precision adaption (Panel d). 


Proof. To show that L-UPDATE is a continuous function, one only need to show that the 
operations in line 7 and 8 are continuous, which follows trivially if '^[AiUi),{AiUi) is invertible 
That is invertible follows from the Poincare separation theorem (Magnus and 


Neudecker, 1988). 


To show that L-UPDATE produces an invertible matrix it is enough to show that Djj > 0 


(row 8) for j = 1, • • •, N. 
the matrix S(AiU),(AiU)> 


This follows from that D 
which is positive definite. 


jj is the Schuur complement of in 

□ ’ □ 


Using Proposition A.l| it is enough to show that the empirical covariance matrix is in a 


compact domain of positive definite matrices. So for example, if equation (14) in (Haario et al. 


2001) is satisfied, then the precision adaptation satisfies Diminsihing Adapation. 
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